Introduction
nf-core RNA-seq v.3.12.0 STAR-Salmon :
This pipeline performs read alignment and transcript quantification of
bulk RNA-seq data. Based on nf-core RNA-seq v.3.12.0
STAR-Salmon, a bioinformatics community project for making
computational methods portable and reproducible, it runs with Nextflow workflow
manager, and has been optimized for differential expression analysis at
the gene level. To demonstrate this pipeline, bulk mRNA-seq dataset with
biological replicates and technical replicates from a publication is
used (cited below). Human genome reference version GRCh38 primary
assembly and annotation version Gencode v44 Basic are used. For more
information on Gencode Basic: click
here
Input files :
* example_DesignFile.csv : Sample names, (relative or absolute)
paths to data files, and NGS
library strandedness according to the
nf-core guideline.
* example_nf-params.json : Pre-defined nf-core parameters for
RNA-seq_v1.1 pipeline. It includes paths to the reference genome
(.fasta) and annotation (.gtf) files.
* example_run_nextflow.sh : Bash script file for initiating a
Nextflow run (a.k.a your magic button). You can adjust the
‘wdir’, ‘proj’ and ‘run’ parameters before initiating the run.
Required changes on example_run_nextflow.sh before
initiating your run :
Change the three parameters and save the file. Use a descriptive run
name.
* wdir = [working_directory]
* proj = [project_title] e.g. “rnaseq”
* run = [run_name] e.g. “star_Salmon_basic_ylim”
Organize your project as [working_directory]/[project_title]/[run_name]/
so you can save the results from multiple runs under a single project
folder. Now this is where your “work” and “result” folders will be
created from the RNA-seq_v1.1 run.
Before running the pipeline :
* Check your raw data size & estimate your project folder size : $
du -hsc /path/to/raw/.fastq.gz
Check storage space in the server : $ df -H /home/lety/data/
* Coordinate with other users for optimal memory usage : $ htop
Output files :
* Quality control (.html) :
/result/multiqc/star_salmon/multiqc_report.html
* Alignment files (.markdup.sorted.bam) :
/result/star_salmon/[sample_name].markdup.sorted.bam
* Count files merged for all samples (.tsv) :
/result/star_salmon/salmon.merged.gene_counts.tsv
* Count files per sample (.sf) :
/result/star_salmon/[sample_name]/quant.genes.sf
BAM file validation :
We found truncated BAM files even with successful completion message
from the Nextflow run. It is because the server storage got full at the
time of writing the files. Such error is not detectable by the original
pipeline, hence additional checks on all BAM files are now added in
example_run_nextflow.sh. Following files can be found in
/result/star_salmon/picard_validatesamfile.
* bad_bams.fofn : quick check whether the BAM files have EOF
(end-of-file) marks.
* bam_summary.txt : full output of Picard ValidateSamFile
* error_messages.txt : file name and detected error and warning messages
extracted from bam_summary.txt. Please check the GATK
troubleshooting page for what they mean.
After running the pipeline :
* Please check execution report, multiqc report, bam file validation
result to assess successful completion of all the stages.
* Please remove scratch/ and work/ directories in the server.
File tree :
├── [working_directory]/
│…. ├── example_DesignFile.csv
│…. ├── example_nf-params.json
│…. ├── example_run_nextflow.sh
│…. ├── CBBI_RNAseq_manual.rmd
│…. ├── [project_title]/
│…. │…. ├── [run_name]/
│…. │…. │…. ├── result/
│…. │…. │…. │…. ├── pipeline_info/
│…. │…. │…. │…. │…. ├── execution_report_yyyy-mm-dd_hh-mm-ss.html
│…. │…. │…. │…. ├── multiqc/
│…. │…. │…. │…. │…. ├── star_salmon/
│…. │…. │…. │…. │…. │…. ├── multiqc_report.html
│…. │…. │…. │…. ├── star_salmon/
│…. │…. │…. │…. │…. ├── [sample_name].markdup.sorted.bam
│…. │…. │…. │…. │…. ├── [sample_name].Aligned.out.bam
│…. │…. │…. │…. │…. ├── salmon.merged.gene_counts.tsv
│…. │…. │…. │…. │…. ├── salmon.merged.gene_tpm.tsv
│…. │…. │…. │…. │…. ├── [sample_name]/
│…. │…. │…. │…. │…. │…. ├── quant.genes.sf
│…. │…. │…. │…. │…. │…. ├── quant.sf
│…. │…. │…. │…. │…. ├── picard_validatesamfile/
│…. │…. │…. │…. │…. │…. ├── error_messages.txt
│…. │…. │…. ├── work/
│…. │…. │…. ├── scratch/
│…. │…. │…. ├── DEAoutput/
│…. │…. │…. │…. ├── Counts.csv
│…. │…. │…. │…. ├── TPM.csv
│…. │…. │…. │…. ├── plot_*.pdf
│…. │…. │…. │…. ├── DEA_all_genes.csv
│…. │…. │…. │…. ├── DEA_significant_genes.csv
│…. │…. │…. │…. ├── DEA_all_genes_IPA.txt
│…. │…. │…. │…. ├── fgsea_result.txt
│…. │…. │…. │…. ├── fgsea_result_main.txt
Set up for DEA
This R Markdown documents (.RMD) file reads in the output files of the RNA-seq_v1.1 pipeline and detects differentially expressed genes using DESeq2 package. It is assumed that this file is in the same [working_directory]. Any figures and gene lists from the analysis would be saved under [working_directory]/[project_title]/[run_name]/DEAoutput.
seed <- 20230816
## install pacman library if not installed
if ( !require("pacman", character.only = TRUE)) {install.packages("pacman", dependencies = TRUE, quiet = TRUE) }
## load and install libraries using pacman
pacman::p_load("dplyr", "ggplot2", "ggfortify", "DESeq2", "IHW", "apeglm", "ashr", "xlsx", "pheatmap", "RColorBrewer", "gridExtra", "ggvenn", "dendextend", "seriation", "Rtsne", "tximport", "tidyr", "plyr", "ggvenn", "ggrepel", "plotly", "ggplotify", "goseq", "stringr", "msigdbr", "clusterProfiler", "fgsea", "data.table", "corrplot", "GenomicTools.fileHandler", "AnnotationDbi", "org.Hs.eg.db", "DT")
'%notin%'<- Negate('%in%')
## aesthetics
color <- list(Condition = c("cornflowerblue","orange2"),
Replicate = c("royalblue4","mediumseagreen","mediumorchid3"),
Extra = c("darkgreen", "firebrick3", "darkcyan", "deeppink2"))
Theme.YL <- ggplot2::theme(text = element_text(size = 14), # family = "Sans",
plot.title = element_text(size = rel(1.3), hjust=0.5),
plot.subtitle = element_text(size = rel(1), hjust=0.5),
# axis.title.x = element_text(size = rel(1)),
# axis.title.y = element_text(size = rel(1)),
legend.text = element_text(size = rel(0.8)),
panel.background = element_rect(fill = 'white', colour = NA),
panel.border = element_rect(fill = NA, colour = 'grey20'),
strip.background = element_rect(colour = 'grey20', fill = 'white'),
panel.grid.major = element_line(colour = 'grey70', linetype = 'dotted'),
plot.margin = unit(c(2,2,2,2), "cm"))
Create SampleSheet
SampleSheet contains sample name, various file paths, condition (i.e. CC_CR and CC), biological replicates (i.e. rep1, rep2, and rep3), and technical replicates (i.e. s1, s2, s3, and s4).
wdir <- here::here() # working directory
if (!dir.exists(wdir)) {
stop("Directory does not exist: ", wdir)
}
proj <- "rnaseq" # your project title
runname <- "star_Salmon_basic_ylim" # descriptive run name
outdir <- paste0(wdir, "/", "DEAoutput/") #"DEAoutput/" # where DEA results will be saved
## create output directory if it does not exist yet
ifelse(!dir.exists(file.path(outdir)), dir.create(file.path(outdir)), FALSE)
## [1] FALSE
## create SampleSheet from the design file
SampleSheet <- read.csv2(paste0(wdir, "/", "example_DesignFile.csv"), sep = ",")
## condition
SampleSheet$condition <- unlist(lapply(strsplit(SampleSheet$sample, "_"),
function(x) paste(x[1:(length(x)-2)], collapse = "_")))
SampleSheet <- SampleSheet %>% mutate(condition = as.factor(condition))
## biological replicate
SampleSheet$bio_rep <- unlist(lapply(strsplit(SampleSheet$sample, "_"),
function(x) ifelse(length(x)==4, paste(x[1:(length(x)-1)], collapse = "_"), paste(x[1:(length(x)-1)], collapse = "_"))))
## technical replicate
SampleSheet$tech_rep <- unlist(lapply(strsplit(SampleSheet$sample, "_"),
function(x) ifelse(length(x)==4, paste(x[3:(length(x)-1)], collapse = "_"), paste(x[2:(length(x)-1)], collapse = "_"))))
## path to the count files
SampleSheet$Salmon_basic_gene_counts <- unlist(lapply(SampleSheet$sample,
function(x) paste0(wdir, "/", proj, "/", runname, "/result/star_salmon/",x,"/quant.genes.sf")))
SampleSheet$Salmon_basic_transcript_counts <- unlist(lapply(SampleSheet$sample,
function(x) paste0(wdir, "/", proj, "/", runname, "/result/star_salmon/",x,"/quant.sf")))
## SampleSheet
# DT::datatable(SampleSheet,
# caption = htmltools::tags$caption(style = 'caption-side: top; text-align: left;', htmltools::em('SampleSheet')))
Load in count matrix
Import the output of Salmon (i.e. quant.genes.sf), which includes relative abundance (=TPM, Transcripts Per Million), estimated counts, and effective length.
## Import the count files using tximport. tximport will automatically group the data on predefined groups. Filepaths and SampleTitle must be in the same order!
GetCounts_mRNA <- function(FilePaths, SampleTitle, QuantType) {
genIdCol <- ifelse(QuantType == "rsem", "gene_id", "Name")
Temp.Files <- FilePaths
names(Temp.Files) <- SampleTitle
tximport_Counts <- tximport::tximport(Temp.Files, type=QuantType, txIn = F, txOut = F, geneIdCol = genIdCol)
tximport_Counts[["length"]][which(tximport_Counts[["length"]] == 0)] <- 1
return(tximport_Counts)
}
## gene-level.
Salmon.basic.Counts <- GetCounts_mRNA(SampleSheet$Salmon_basic_gene_counts, SampleSheet$sample, "salmon")
## transcript-level.
# Salmon.basic.trans.Counts <- GetCounts_mRNA(SampleSheet$Salmon_basic_transcript_counts, SampleSheet$sample, "salmon")
## save TPM and Counts into CSV files
write.csv(Salmon.basic.Counts$abundance, file = paste0(outdir,"TPM.csv"))
write.csv(Salmon.basic.Counts$counts, file = paste0(outdir,"Counts.csv"))
Inspect the data
Distribution of genes with zero count across all samples. The genes that are not detected in all 24 samples can be filtered out before performing DESeq.
## row-wise (per gene) frequency of zeros
df <- data.frame(Salmon.basic.Counts$counts)
df$numZero <- rowSums(df == 0)
df$row.names <- rownames(df)
ggplot(df[,c('row.names', 'numZero')], aes(x=numZero)) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3))) +
stat_bin(binwidth=1, center=0, geom="text", aes(label=after_stat(count)), vjust=-0.5, closed="left") +
labs(title = "Distribution of genes with zero counts", x = "Number of zero counts across samples", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5)) + Theme.YL
# Differential Expression Analysis ## The two conditions : CC-CR vs. CC
“Cetuximab and panitumumab are epidermal growth factor receptor (EGFR)
monoclonal antibodies (mAbs) that bind the extracellular domain of EGFR
and enhance receptor internalization and degradation. These EGFR mAbs
are common targeted agents for patients with wild-type KRAS metastatic
colorectal cancer (CRC). … We generated a cell line from colonies with
cystic morphology, which we designated cystic colonies (CC) and
cetuximab-resistant CC (CC-CR). CC tumors were well differentiated and
regressed following administration of cetuximab. CC-CR tumors were
poorly differentiated and continued to grow in the presence of
cetuximab, although not to the extent of untreated tumors.”
## Create colData
coldata <- SampleSheet[,c("sample","condition","bio_rep", "tech_rep")]
## factorize $condition with "CC" condition as reference
coldata$condition <- factor(coldata$condition)
coldata$condition <- relevel(coldata$condition, ref = "CC")
rownames(coldata) <- coldata$sample
coldata$sample <- NULL
## Reorder counts to the coldata
Salmon.basic.Counts.Ordered <- lapply(Salmon.basic.Counts[1:length(Salmon.basic.Counts)-1], function(x) data.frame(x) %>% dplyr::select(rownames(coldata)) %>% as.matrix())
Salmon.basic.Counts.Ordered$countsFromAbundance <- Salmon.basic.Counts$countsFromAbundance
## create DDS (DESeq Data Set)
dds.Salmon.basic <- DESeq2::DESeqDataSetFromTximport(txi = Salmon.basic.Counts.Ordered,
colData = coldata,
design = ~condition)
dds.Salmon.basic
## class: DESeqDataSet
## dim: 62754 24
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(62754): ENSG00000278625.1 ENSG00000276017.1 ...
## ENSG00000112062.12 ENSG00000258289.10
## rowData names(0):
## colnames(24): CC_rep1_s1 CC_rep1_s2 ... CC_CR_rep3_s3 CC_CR_rep3_s4
## colData names(3): condition bio_rep tech_rep
Pre-filtering
## filter out the genes with zero count across all samples
keep <- rowSums(counts(dds.Salmon.basic)) > 0
dds.Salmon.basic <- dds.Salmon.basic[keep,]
## (optional) filter out mitochondrial genes
# Create a mapping from Ensembl gene IDs to gene names so we can remove MT- genes
# gtf.path <- ""
gtf.data <- GenomicTools.fileHandler::importGTF(gtf.path, level="gene", features=c("gene_id", "gene_name", "gene_type"))
## Automatically detected number of rows to skip: 5
## List of features in column 9:
## -----------------------------
## gene_id
## gene_type
## gene_name
## level
## tag
gene.map <- gtf.data[, c("gene_id", "gene_name")]
# Merge DESeqDataSet gene IDs with gene names from GTF
gene.map <- merge(data.frame(gene_id = rownames(dds.Salmon.basic)),
gene.map,
by = "gene_id", all.x = TRUE)
# Identify mitochondrial genes by gene name (adjust this according to your dataset)
mito.genes <- grep("^MT-", gene.map$gene_name, value = TRUE)
# Filter out mitochondrial genes from the DESeqDataSet
dds.Salmon.basic.filtered <- dds.Salmon.basic[!rownames(dds.Salmon.basic) %in%
gene.map$gene_id[gene.map$gene_name %in% mito.genes], ]
## (optional) pre-filtering based on biotypes (gene_type) from the gtf file
## https://www.gencodegenes.org/pages/biotypes.html
biotype.map <- gtf.data[, c("gene_id", "gene_type")]
unique(biotype.map$gene_type)
## [1] "lncRNA" "transcribed_unprocessed_pseudogene"
## [3] "unprocessed_pseudogene" "miRNA"
## [5] "protein_coding" "processed_pseudogene"
## [7] "snRNA" "transcribed_processed_pseudogene"
## [9] "misc_RNA" "TEC"
## [11] "transcribed_unitary_pseudogene" "snoRNA"
## [13] "scaRNA" "rRNA_pseudogene"
## [15] "unitary_pseudogene" "pseudogene"
## [17] "rRNA" "IG_V_pseudogene"
## [19] "scRNA" "IG_V_gene"
## [21] "IG_C_gene" "IG_J_gene"
## [23] "sRNA" "ribozyme"
## [25] "translated_processed_pseudogene" "vault_RNA"
## [27] "TR_C_gene" "TR_J_gene"
## [29] "TR_V_gene" "TR_V_pseudogene"
## [31] "TR_D_gene" "IG_C_pseudogene"
## [33] "TR_J_pseudogene" "IG_J_pseudogene"
## [35] "IG_D_gene" "IG_pseudogene"
## [37] "artifact" "Mt_tRNA"
## [39] "Mt_rRNA"
(optional) pre-filtering to keep only rows that have a count of at least 10 for a minimal number of samples (optional) pre-filtering lowly expressed genes log(tpm+1) > 0.5
## row-wise (per gene) frequency of zeros
df <- data.frame(counts(dds.Salmon.basic.filtered))
df$numZero <- rowSums(df == 0)
df$row.names <- rownames(df)
ggplot(df[,c('row.names', 'numZero')], aes(x=numZero)) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3))) +
stat_bin(binwidth=1, center=0, geom="text", aes(label=after_stat(count)), vjust=-0.5, closed="left") +
labs(title = "Distribution of genes with zero counts", x = "Number of zero counts across samples", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5)) + Theme.YL
Perform DEA & DESeq2 results
## perform DESeq
dds.Salmon.basic.filtered <- DESeq2::DESeq(dds.Salmon.basic.filtered)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 57 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
- “CC” condition as control & FDR < 0.05.
- cooksCutoff = TRUE (default): This resets p-values of outlier genes to NA, removing their influence on the results.
- independentFiltering = TRUE (default): This filters out genes with low counts or high variability, helping to focus on more reliable result.
- filterFun = ihw: This applies the ‘Independent hypothesis weighting’ method to adjust p-values for multiple comparisons, reducing false positives.
## DESeq2 results
FDR = 0.05
res.Salmon.basic <- DESeq2::results(dds.Salmon.basic.filtered,
pAdjustMethod = "BH", alpha = FDR,
contrast = c("condition", "CC_CR", "CC"),
cooksCutoff = TRUE,
independentFiltering = TRUE,
filterFun=ihw)
## Column names of the DESeq2 results
data.frame(mcols(res.Salmon.basic, use.names=T))
## type description
## baseMean intermediate mean of normalized counts for all samples
## log2FoldChange results log2 fold change (MLE): condition CC_CR vs CC
## lfcSE results standard error: condition CC CR vs CC
## stat results Wald statistic: condition CC CR vs CC
## pvalue results Wald test p-value: condition CC CR vs CC
## padj results Weighted BH adjusted p-values
## weight results IHW weights
## Summary of the DESeq result
summary(res.Salmon.basic)
##
## out of 34346 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 2545, 7.4%
## LFC < 0 (down) : 2543, 7.4%
## outliers [1] : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting
## How many adjusted p-values were less than 0.05?
sum(res.Salmon.basic$padj < FDR, na.rm=TRUE)
## [1] 5088
## Meta data
metadata(res.Salmon.basic)
## $alpha
## [1] 0.05
##
## $ihwResult
## ihwResult object with 34378 hypothesis tests
## Nominal FDR control level: 0.05
## Split into 22 bins, based on an ordinal covariate
##
## $lfcThreshold
## [1] 0
Add gene names to the result using the same gtf file
res.Salmon.basic$gene <- gtf.data$gene_name[match(rownames(res.Salmon.basic), gtf.data$gene_id)]
res.Salmon.basic$ensembl <- gsub('\\..+$', '', rownames(res.Salmon.basic))
res.Salmon.basic$entrez <- AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db, keys=res.Salmon.basic$ensembl, column="ENTREZID", keytype="ENSEMBL", multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
Data Visualization
Log fold change (LFC) shrinkage
Shrinkage of effect size (LFC estimates) is useful for visualization and ranking of genes. ‘apeglm’ and ‘ashr’ are preferred method over ‘normal’. Please read more about each method from the original papers (cited below) as well as from the Alternative shrinkage estimators and Extended section on shrinkage estimators. As for deciding between apeglm vs. ashr, see this thread and this thread.
## LFC shrinkage methods : type = c("apeglm", "ashr", "normal")
resLFC.Salmon.basic.apeglm <- DESeq2::lfcShrink(dds.Salmon.basic.filtered, res = res.Salmon.basic, type = "apeglm",
coef = "condition_CC_CR_vs_CC")
## Warning in nbinomGLM(x = x, Y = YNZ, size = size, weights = weightsNZ, offset =
## offsetNZ, : the line search routine failed, unable to sufficiently decrease the
## function value
resLFC.Salmon.basic.ashr <- DESeq2::lfcShrink(dds.Salmon.basic.filtered, res = res.Salmon.basic, type = "ashr")
resLFC.Salmon.basic.normal <- DESeq2::lfcShrink(dds.Salmon.basic.filtered, res = res.Salmon.basic, type = "normal",
coef = "condition_CC_CR_vs_CC")
## reorder by adjusted p-value
resLFC.Salmon.basic.apeglm <- resLFC.Salmon.basic.apeglm[order(resLFC.Salmon.basic.apeglm$padj),]
resLFC.Salmon.basic.ashr <- resLFC.Salmon.basic.ashr[order(resLFC.Salmon.basic.ashr$padj),]
resLFC.Salmon.basic.normal <- resLFC.Salmon.basic.normal[order(resLFC.Salmon.basic.normal$padj),]
## MA plot before/after shrinkage
Plot.MA <- function(ma.data, baseMeanCutoff = 10, FDR = 0.05, title = "", colSig = "black") {
ma <- data.frame(ma.data)
ma <- ma %>% dplyr::mutate(isDE = case_when(
abs(log2FoldChange) > 0.5 & baseMean >= baseMeanCutoff & padj < FDR ~ TRUE,
TRUE ~ FALSE),
isDE = as.factor(isDE))
ma <- ma %>% dplyr::arrange(isDE)
ma.plot <- ma %>% data.frame() %>% ggplot2::ggplot(aes(x=baseMean, y=log2FoldChange, color=isDE, shape=isDE, fill=isDE)) +
geom_point(aes(shape = isDE, color=isDE), size = 3) +
scale_x_continuous(trans = "log10") +
scale_color_manual(values = c("grey80", colSig)) +
scale_shape_manual(values= c(19,16)) +
geom_vline(xintercept = 50, color = 'grey50', linetype = "dashed") +
geom_hline(yintercept = c(-.5, .5), color = 'grey50', linetype = "dashed") +
Theme.YL + ggtitle(label = title)
plot(ma.plot)
}
Plot.MA(res.Salmon.basic, title = paste("MA plot (before shrinkage)",
"min:", round(range(res.Salmon.basic$log2FoldChange),2)[1],
"max:", round(range(res.Salmon.basic$log2FoldChange),2)[2]),
colSig = color$Extra[1])
Plot.MA(resLFC.Salmon.basic.apeglm, title = paste("MA plot (after apeglm shrinkage)",
"min:", round(range(resLFC.Salmon.basic.apeglm$log2FoldChange),2)[1],
"max:", round(range(resLFC.Salmon.basic.apeglm$log2FoldChange),2)[2]),
colSig = color$Extra[2])
Plot.MA(resLFC.Salmon.basic.ashr, title = paste("MA plot (after ashr shrinkage)",
"min:", round(range(resLFC.Salmon.basic.ashr$log2FoldChange),2)[1],
"max:", round(range(resLFC.Salmon.basic.ashr$log2FoldChange),2)[2]),
colSig = color$Extra[3])
Plot.MA(resLFC.Salmon.basic.normal, title = paste("MA plot (after normal shrinkage)",
"min:", round(range(resLFC.Salmon.basic.normal$log2FoldChange),2)[1],
"max:", round(range(resLFC.Salmon.basic.normal$log2FoldChange),2)[2]),
colSig = color$Extra[4])
Save DEA result
Save DEA result - Shrunk LFC
## choose one shrinkage method
resLFC.Salmon.basic <- resLFC.Salmon.basic.normal
rm(resLFC.Salmon.basic.apeglm, resLFC.Salmon.basic.ashr)
write.table(resLFC.Salmon.basic, file = paste0(outdir,"DEA_all_genes_shrunkLFC.tsv"), sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
IPA formatted file - after shrinkage of LFC
ipa <- resLFC.Salmon.basic[,c("gene","ensembl","entrez", "baseMean", "log2FoldChange","pvalue", "padj")]
write.table(ipa, file = paste0(outdir,"DEA_all_genes_IPA.txt"), sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
Volcano Plot
A volcano plot allows quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.
mean.cutoff = 10
lfc.cutoff = 0.5
threshold <- resLFC.Salmon.basic$padj < FDR &
resLFC.Salmon.basic$baseMean > mean.cutoff &
abs(resLFC.Salmon.basic$log2FoldChange) > lfc.cutoff
df <- data.frame(resLFC.Salmon.basic)
df$Significant <- threshold
df$genelabels <- rownames(df) %in% rownames(df)[df$Significant == TRUE][1:10]
p <- ggplot(df) +
geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = Significant, alpha = Significant, size = Significant)) +
scale_color_manual(values=color$Extra) +
geom_label_repel(aes(x = log2FoldChange, y = -log10(padj), label = ifelse(genelabels == T, df$gene,""))) +
labs(title = "Volcano Plot (CC_CR vs. CC)",
subtitle = paste0("10 significant genes are labelled. LFC cutoff = ", lfc.cutoff,
", p-value cutoff = ", FDR, ", baseMean cutoff = ", mean.cutoff)) +
xlab("log2 fold change") + ylab("-log10 adjusted p-value") +
geom_vline(xintercept = c(lfc.cutoff, -lfc.cutoff), colour = "grey50", linetype = 'dashed') +
geom_hline(yintercept = FDR, colour = "grey50", linetype = 'dashed') +
Theme.YL + scale_size_manual(values = c(1,2))
plot(p)
## Warning: Using alpha for a discrete variable is not advised.
## Warning: Use of `df$gene` is discouraged.
## ℹ Use `gene` instead.
## png
## 2
PlotCounts
Check out counts for a gene of your interest. Normalized counts plus a pseudocount of 0.5 are shown by default.
gene.interest.name = "MIR100HG" # high variability within CC_CR group
gene.interest = rownames(resLFC.Salmon.basic[resLFC.Salmon.basic$gene == gene.interest.name,])
d <- plotCounts(dds.Salmon.basic.filtered, gene = gene.interest, intgroup = "condition", returnData = TRUE)
d$name <- rownames(d)
d$tech_rep <- SampleSheet$tech_rep
ggplot(d, aes(x=condition, y=count, color=tech_rep)) + scale_color_manual(values=color$Replicate) + scale_fill_manual(values=color$Condition) +
geom_point(position=position_jitter(w=0.1,h=0)) + geom_boxplot(aes(group = condition, fill= condition),alpha = 0.3) + #geom_text_repel(aes(label = name)) +
ggtitle(paste(gene.interest.name, "")) + Theme.YL
gene.interest.name = "RPL5" # low variability within groups
gene.interest = rownames(resLFC.Salmon.basic[resLFC.Salmon.basic$gene == gene.interest.name,])
d <- plotCounts(dds.Salmon.basic.filtered, gene = gene.interest, intgroup = "condition", returnData = TRUE)
d$name <- rownames(d)
d$tech_rep <- SampleSheet$tech_rep
ggplot(d, aes(x=condition, y=count, color=tech_rep)) + scale_color_manual(values=color$Replicate) + scale_fill_manual(values=color$Condition) +
geom_point(position=position_jitter(w=0.1,h=0)) + geom_boxplot(aes(group = condition, fill= condition),alpha = 0.3) + #geom_text_repel(aes(label = name)) +
ggtitle(paste(gene.interest.name, "")) + Theme.YL
Clustering anlaysis
Data Transformation
For visualization or clustering, it might be useful to work with transformed versions of the count data. The regularized log (rlog) function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. The rlog transformation produces a similar variance stabilizing effect as Variance Stabilizing Transformation (vst), though rlog is more robust in the case when the size factors vary widely.
TransformCounts <- function(ddsMatrix, TransformationType = "rlog", blind = TRUE, fitType = "parametric") {
## user can choose between rlog and vst as their data transformation method
## check if the ddsMatrix contains already calculated dispersion values or not
if (sum(is.na(SummarizedExperiment::rowData(ddsMatrix)[,"dispersion"])) == 0) {
transformedCounts <- switch(
TransformationType,
"rlog" = DESeq2::rlog(ddsMatrix, fitType = fitType, blind = blind),
"vst" = DESeq2::varianceStabilizingTransformation(ddsMatrix, fitType = fitType, blind = blind))
} else {
transformedCounts <- switch(
TransformationType,
"rlog" = DESeq2::rlog(estimateDispersions(estimateSizeFactors(ddsMatrix)), fitType = fitType, blind = blind),
"vst" = DESeq2::varianceStabilizingTransformation(estimateDispersions(estimateSizeFactors(ddsMatrix)), fitType = fitType, blind = blind))
}
hist(assay(ddsMatrix))
hist(assay(transformedCounts))
return(transformedCounts)
}
## all genes : N = 34378
rlog.dds.Salmon.basic <- TransformCounts(dds.Salmon.basic.filtered, "rlog")
## Criteria : baseMean > 10 AND adjusted p-value < 0.05 AND |log2FoldChange| > 0.5
# replace NA values to FC:0 and padj:1, then select significant genes
res.Salmon.basic.Filter <- res.Salmon.basic %>% data.frame() %>%
dplyr::mutate_at(vars(contains("log2FoldChange")), ~replace(., is.na(.), 0)) %>%
dplyr::mutate_at(vars(contains("padj")), ~replace(., is.na(.), 1)) %>%
dplyr::filter(baseMean > mean.cutoff, padj < FDR, abs(log2FoldChange) > lfc.cutoff)
res.Salmon.basic.Filter <- res.Salmon.basic.Filter[order(res.Salmon.basic.Filter$padj),]
## only the significant genes : N = 831
rlog.dds.Salmon.basic.sig <- TransformCounts(dds.Salmon.basic.filtered[rownames(res.Salmon.basic.Filter),], "rlog")
dds.Salmon.basic.filtered[rownames(res.Salmon.basic.Filter),]
## class: DESeqDataSet
## dim: 1008 24
## metadata(1): version
## assays(8): counts avgTxLength ... replaceCounts replaceCooks
## rownames(1008): ENSG00000122406.14 ENSG00000117523.18 ...
## ENSG00000186567.14 ENSG00000289463.1
## rowData names(23): baseMean baseVar ... maxCooks replace
## colnames(24): CC_rep1_s1 CC_rep1_s2 ... CC_CR_rep3_s3 CC_CR_rep3_s4
## colData names(4): condition bio_rep tech_rep replaceable
Visualizing clusters with heatmap and dendogram
Using the count matrix with all genes, rlog transformation had been performed, and Euclidean distances among the samples are visualized as a heatmap and hierarchical clustering is performed using complete linkage method. Learn more about how to use pheatmap package: click here
## calculate distance
sampleDists <- dist(t(assay(rlog.dds.Salmon.basic)), method = "euclidean")
sampleDistMatrix <- as.matrix(sampleDists)
## annotation color
mat_col <- data.frame(Condition = rlog.dds.Salmon.basic$condition, Replicate = rlog.dds.Salmon.basic$tech_rep)
rownames(mat_col) <- colnames(sampleDistMatrix)
mat_colors <- list(Condition = color$Condition, Replicate = color$Replicate)
names(mat_colors$Condition) <- unique(mat_col$Condition)
names(mat_colors$Replicate) <- unique(mat_col$Replicate)
## heatmap
p <- ggplotify::as.ggplot(pheatmap(mat = sampleDistMatrix,
clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists,
clustering_method = "complete",
color = colorRampPalette(rev(brewer.pal(n = 9, name = "YlGnBu")))(255),
border_color = NA, cutree_cols=2, show_rownames = FALSE, fontsize = 15, #fontfamily = "Sans",
width = ncol(sampleDistMatrix)*unit(7, "mm"), height = nrow(sampleDistMatrix)*unit(7, "mm"),
cellwidth = 20, cellheight = 16,
annotation_col = mat_col, annotation_colors = mat_colors,
main = "Clustering heatmap using all genes"))
## png
## 2
Same as the plot above, but using the count matrix with only the significant genes.
## calculate distance
sampleDists <- dist(t(assay(rlog.dds.Salmon.basic.sig)), method = "euclidean")
sampleDistMatrix <- as.matrix(sampleDists)
## annotation color
mat_col <- data.frame(Condition = rlog.dds.Salmon.basic.sig$condition, Replicate = rlog.dds.Salmon.basic.sig$tech_rep)
rownames(mat_col) <- colnames(sampleDistMatrix)
mat_colors <- list(Condition = color$Condition, Replicate = color$Replicate)
names(mat_colors$Condition) <- unique(mat_col$Condition)
names(mat_colors$Replicate) <- unique(mat_col$Replicate)
## heatmap
p <- ggplotify::as.ggplot(pheatmap(mat = sampleDistMatrix,
clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists,
clustering_method = "complete",
color = colorRampPalette(rev(brewer.pal(n = 9, name = "YlGnBu")))(255),
border_color = NA, cutree_cols=2, show_rownames = FALSE, fontsize = 15, #fontfamily = "Sans",
width = ncol(sampleDistMatrix)*unit(7, "mm"), height = nrow(sampleDistMatrix)*unit(7, "mm"),
cellwidth = 20, cellheight = 16,
annotation_col = mat_col, annotation_colors = mat_colors,
main = "Clustering heatmap using Significant genes"))
## png
## 2
PCA and t-SNE plot
To make the large dataset of 34395 rows and 6 columns, dimensionality
reduction techniques like PCA (Principal component analysis) and t-SNE
(t-distributed stochastic neighbor embedding) are utilized for data
visualization. We can utilize the clustering results as quality control
measure of the experiment.
Recreation of Figure 2a
Scaling (z-transformation)
z_Transform <- function(rlog_dds) {
normalized.counts = SummarizedExperiment::assay(rlog_dds)
## gene-wise z-transformation
m = apply(normalized.counts, 1, mean)
s = apply(normalized.counts, 1, sd)
z = data.frame((normalized.counts - m)/s)
## add gene names
# gtf.data <- GenomicTools.fileHandler::importGTF(gtf.path, level="gene", features=c("gene_id", "gene_name", "gene_type"))
z$gene = gtf.data$gene_name[match(rownames(rlog_dds), gtf.data$gene_id)]
z$ensembl <- gsub('\\..+$', '', rownames(rlog_dds))
z$entrez <- AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db, keys=z$ensembl, column="ENTREZID", keytype="ENSEMBL", multiVals = "first")
return(z)
}
z.counts <- z_Transform(rlog.dds.Salmon.basic)
## 'select()' returned 1:many mapping between keys and columns
write.csv(z.counts, file = paste0(outdir,"gene-wise-z-transformation.csv"))
Recreation of Figure 2a
Heatmap of the top 50 differentially expressed transcripts in CC-CR versus CC from three independent 3D culture experiments. Gene expression values are gene-wise z-transformed and are colored red for high abundance and blue for low abundance, as indicated in the scale bar.
## Top 50 genes from the original paper (fig 2a)
top50 = c("MIR100HG","HS3ST5","A2M","ZNF804A", "ARHGAP29", "SESN3", "SATB1", "SERPINE2", "AKAP12", "EPHB6", "GAS6", "IGFBP6", "CUEDC1", "CGN", "MESDC1", "VIL1", "SNTB1", "RP11-395P17.3", "ALPK3", "LOXL1", "CAPS", "FYN", "H19", "RASGRF1", "CDK6", "PKP1", "DYSF", "HLA-DRA", "ENC1", "QPCT", "IGF2BP2", "AC007952.5", "SPDYC", "HOXD10", "CIB2", "CFTR", "HOXD11", "TEX30", "LINC01018", "SERTAD4", "DKK3", 'GATA6', 'GSTM4', "CPVL", 'DKK1', 'EPB41L3', 'ZNK608', "CRABP1", "SERP2", "OLFM1")
z.counts.top50 <- z.counts[z.counts$gene %in% top50,]
z.counts.top50 <- z.counts.top50 %>% arrange(factor(gene, levels = top50))
rownames(z.counts.top50) <- z.counts.top50$gene
z.counts.top50$gene <- NULL
z.counts.top50$ensembl <- NULL
z.counts.top50$entrez <- NULL
## Heatmap using gene-wise z-transformed gene expression values
mat_col <- data.frame(sample = rownames(coldata),
Condition = coldata$condition)
rownames(mat_col) <- mat_col$sample
mat_col$sample <- NULL
p <- pheatmap(z.counts.top50, cluster_rows = FALSE,
border_color = NA, fontsize = 15,
width = ncol(z.counts.top50)*unit(7, "mm"), height = nrow(z.counts.top50)*unit(7, "mm"),
cellwidth = 20, cellheight = 16,
annotation_col = mat_col, annotation_colors = mat_colors,
main = "Top 50 differentially expressed genes in CC_CR vs. CC")
PlotCounts of MIR100HG gene
gene.interest.name = "MIR100HG"
gene.interest = rownames(resLFC.Salmon.basic[resLFC.Salmon.basic$gene == gene.interest.name,])
d <- plotCounts(dds.Salmon.basic.filtered, gene = gene.interest, intgroup = "condition", returnData = TRUE)
d$name <- rownames(d)
d$tech_rep <- SampleSheet$tech_rep
ggplot(d, aes(x=condition, y=count, color=tech_rep)) + scale_color_manual(values=color$Replicate) + scale_fill_manual(values=color$Condition) +
geom_point(position=position_jitter(w=0.1,h=0)) + geom_boxplot(aes(group = condition, fill= condition),alpha = 0.3) + #geom_text_repel(aes(label = name)) +
ggtitle(paste(gene.interest.name, "")) + Theme.YL
## png
## 2
Functional Analysis
Gene set enrichment analysis (GSEA)
Preranked
gene set enrichment analysis (GSEA) is a widely used method for
interpretation of gene expression data in terms of biological processes.
Utilizing (shrunken) log2 fold changes for all genes, see whether gene
sets for particular biological pathways are enriched among the positive
or negative fold changes.
MSigDB
Reactome Pathway
Database
Ranking LFCs
## Remove NAs
gseaDat <- resLFC.Salmon.basic[!is.na(resLFC.Salmon.basic$entrez),]
## Remove duplicated gene names -- 1) significant? 2) highest baseMean
length(gseaDat$entrez[duplicated(gseaDat$entrez)])
## [1] 83
## 1) significant?
gseaDat <- subset(gseaDat, abs(log2FoldChange) > lfc.cutoff & padj < FDR)
## No duplicates
length(gseaDat$entrez[duplicated(gseaDat$entrez)])
## [1] 0
## Rank all genes based on their shrunken fold change
geneList <- gseaDat$log2FoldChange
names(geneList) <- gseaDat$entrez
## Plot the ranked fold changes in descending order
geneList <- sort(geneList, decreasing = T)
p <- barplot(geneList, main = "Ranked fold changes of all genes in descending order")
# pdf(paste0(outdir,"plot_rank_LFC.pdf"), width=10, height=8)
# plot(p)
# dev.off()
MSigDB & clusterProfiler
## Choose a category from MSigDB
MSigDB <- msigdbr::msigdbr(species = "Homo sapiens", category = "H")
MSigDB <- MSigDB %>% as.data.frame %>% dplyr::select(gs_name, entrez_gene)
set.seed(20250319)
gsea <- clusterProfiler::GSEA(geneList, TERM2GENE = MSigDB,
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05,
pAdjustMethod = "BH")
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## significant gene set
gsea@result %>% dplyr::select(ID, NES, qvalue, setSize, rank) %>% dplyr::arrange(qvalue) %>% head()
## ID NES
## HALLMARK_INTERFERON_GAMMA_RESPONSE HALLMARK_INTERFERON_GAMMA_RESPONSE 2.683109
## HALLMARK_INTERFERON_ALPHA_RESPONSE HALLMARK_INTERFERON_ALPHA_RESPONSE 2.368676
## qvalue setSize rank
## HALLMARK_INTERFERON_GAMMA_RESPONSE 0.0004538009 25 147
## HALLMARK_INTERFERON_ALPHA_RESPONSE 0.0017829376 19 335
openxlsx::write.xlsx(gsea@result %>% dplyr::arrange(qvalue),
file = paste0(outdir, "/gsea.hallmark.xlsx"))
GeneRatio = count / setSize. ‘count’ is the number of genes that belong to a given gene-set, while ‘setSize’ is the total number of genes in the gene-set.
enrichplot::dotplot(gsea, showCategory = 10) +
ggplot2::scale_fill_gradient(low = "#132B43", high = "#56B1F7", limits = c(0, 1), name = "qvalue")
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
enrichplot::gseaplot(gsea, geneSetID = 1, title = gsea$Description[1])
enrichplot::gseaplot(gsea, geneSetID = 1, title = gsea$Description[2])
reactomePathways
## Pathway analysis
pathways <- fgsea::reactomePathways(names(geneList))
fgseaRes <- fgsea::fgsea(pathways, geneList, maxSize = 500)
## Top 10 pathways
head(fgseaRes[order(padj, -abs(NES)), ], n=10)[,c(1,3,5)]
## pathway padj ES
## <char> <num> <num>
## 1: Interferon alpha/beta signaling 0.3896538 0.5227749
## 2: Antiviral mechanism by IFN-stimulated genes 0.3896538 0.5275529
## 3: Platelet degranulation 0.3896538 0.6107831
## 4: Response to elevated platelet cytosolic Ca2+ 0.3896538 0.6107831
## 5: ISG15 antiviral mechanism 0.3896538 0.5944044
## 6: Gastrulation 0.3896538 -0.6505853
## 7: Disorders of transmembrane transporters 0.3896538 -0.6820418
## 8: FOXO-mediated transcription 0.3896538 0.5555556
## 9: RHO GTPase Effectors 0.3896538 -0.6323468
## 10: Crosslinking of collagen fibrils 0.3896538 -0.7974026
## Enrichment plot of the first pathway
pathway.interest = head(fgseaRes[order(padj, -abs(NES)), ], n=10)$pathway[1]
p <- fgsea::plotEnrichment(pathways[[pathway.interest]], geneList) + ggplot2::labs(title=pathway.interest) + Theme.YL
# pdf(paste0(outdir,"plot_pathway_", gsub('\\s|\\/', '_', pathway.interest), ".pdf"), width=10, height=8)
plot(p)
# dev.off()
GO enrichment analysis
Gene Ontology (GO) analysis is used to highlight biological processes. From the GOseq paper : “One simple, but extremely widely used, systems biology technique for highlighting biological processes is gene category over-representation analysis. In order to perform this analysis, genes are grouped into categories by some common biological property and then tested to find categories that are over represented amongst the differentially expressed genes. Gene Ontology (GO) categories are commonly used in this technique and there are many tools available for performing GO analysis. … In order to correct for selection bias in category testing, we propose the following three-step methodology. First, the genes that are significantly DE between conditions are identified. The GOseq method works with any procedure for identifying DE genes. Second, the likelihood of DE as a function of transcript length is quantified. This is obtained by fitting a monotonic function to DE versus transcript length data. Finally, the DE versus length function is incorporated into the statistical test of each category’s significance. This final step takes into account the lengths of the genes that make up each category.”
## Create a list of differentially expressed genes (1=DE, 0=Not DE)
genes <- as.integer(threshold)
names(genes) <- resLFC.Salmon.basic$ensembl
table(genes)
## genes
## 0 1
## 33702 676
## Fit the Probability Weighting Function (PWF)
gene_lengths <- SummarizedExperiment::assays(dds.Salmon.basic.filtered)[["avgTxLength"]]
gene_lengths$median <- apply(gene_lengths, 1, median)
pwf <- goseq::nullp(genes, "hg38", "ensGene", bias.data = gene_lengths$median)
## Conduct GO enrichment analysis
goResults <- goseq::goseq(pwf, "hg38", "ensGene", test.cats=c("GO:BP"))
## Plot the top 10
GO_top10 <- goResults %>%
top_n(10, wt=-over_represented_pvalue) %>%
mutate(hitsPerc=numDEInCat*100/numInCat) %>%
ggplot(aes(x=hitsPerc,
y=term,
colour=over_represented_pvalue,
size=numDEInCat)) +
geom_point() +
expand_limits(x=0) +
labs(x="Hits (%)", y="GO term", colour="p value", size="Count", title = "GO enrichment (Top 10)") + Theme.YL
# pdf(paste0(outdir,"plot_GO_top10.pdf"), width=12, height=10)
plot(GO_top10)
# dev.off()
Further Analysis
Qiagen Ingenuity Pathway Analysis (IPA) :
Should you wish to perform the IPA analysis, contact the admin of the
software to create your account. You will get an email with activation
instruction upon approval. IPA appropriately formatted file can be found
at :
[working_directory]/[project_title]/[run_name]/DEAoutput/DEA_all_genes_IPA.txt
Please watch the following videos before using the software.
* Data Formatting in IPA : https://tv.qiagenbioinformatics.com/video/50301639/data-formatting-in-ipa
* Data Upload in IPA : https://tv.qiagenbioinformatics.com/video/50301463/data-upload-in-ipa
* Run Core Analysis & Interpret Results : https://tv.qiagenbioinformatics.com/video/64233123/making-your-rnaseq-microarray
Fusion gene analysis using Arriba :
Fusion genes can be detected by utilizing intermediate files of STAR run
as input files for Arriba. Upon completion of the RNA-seq pipeline, the
files can be found at
[working_directory]/[project_title]/[run_name]/result/star_salmon/[sample_name].Aligned.out.bam
When you install
Arriba using Bioconda, Blacklist, Known fusions, and Protein domains
of GRCh38 (and others) are included in your conda environment. The files
can be found at
[path/to/your/conda/environment]/var/lib/arriba/blacklist_hg38_GRCh38_v2.4.0.tsv.gz
Literature
Literature :
* STAR for read alignment of RNA-seq data:
Alexander Dobin and others, STAR: ultrafast universal RNA-seq aligner,
Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635
https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
* Salmon for transcript quantification from RNA-seq data:
Patro, R., Duggal, G., Love, M. et al. Salmon provides fast and
bias-aware quantification of transcript expression. Nat Methods 14,
417–419 (2017). https://doi.org/10.1038/nmeth.4197
https://combine-lab.github.io/salmon/
* DESeq2 for differential expression analysis based on the negative
binomial distribution:
Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold
change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550
(2014). https://doi.org/10.1186/s13059-014-0550-8
http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#standard-workflow
* DESeq2 shrinkage method ‘ashr’:
Matthew Stephens, False discovery rates: a new deal, Biostatistics,
Volume 18, Issue 2, April 2017, Pages 275–294, https://doi.org/10.1093/biostatistics/kxw041
* DESeq2 shrinkage method ‘apeglm’:
Anqi Zhu, Joseph G Ibrahim, Michael I Love, Heavy-tailed prior
distributions for sequence count data: removing the noise and preserving
large differences, Bioinformatics, Volume 35, Issue 12, June 2019, Pages
2084–2092, https://doi.org/10.1093/bioinformatics/bty895
* GO enrichment analysis :
Young, M.D., Wakefield, M.J., Smyth, G.K. et al. Gene ontology analysis
for RNA-seq: accounting for selection bias. Genome Biol 11, R14 (2010).
https://doi.org/10.1186/gb-2010-11-2-r14
* Pathway analysis :
Fast gene set enrichment analysis, Gennady Korotkevich, Vladimir Sukhov,
Nikolay Budin, Boris Shpak, Maxim N. Artyomov, Alexey Sergushichev. doi:
https://doi.org/10.1101/060012
* Pathway database :
Marc Gillespie and others, The reactome pathway knowledgebase 2022,
Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages
D687–D692, https://doi.org/10.1093/nar/gkab1028
* Example mRNA-seq data :
Lu, Y., Zhao, X., Liu, Q. et al. lncRNA MIR100HG-derived miR-100 and
miR-125b mediate cetuximab resistance via Wnt/β-catenin signaling. Nat
Med 23, 1331–1341 (2017). https://doi.org/10.1038/nm.4424
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE82236
* Qiagen Ingenuity Pathway Analysis (IPA) :
Bioinformatics, Volume 30, Issue 4, February 2014, Pages 523–530, https://doi.org/10.1093/bioinformatics/btt703
* Arriba for fusion gene analysis :
Uhrig S, Ellermann J, Walther T, et al. Accurate and efficient detection
of gene fusions from RNA sequencing data. Genome Res.
2021;31(3):448-460. doi:10.1101/gr.257246.119
Session info
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
## [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
##
## time zone: Europe/Amsterdam
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] DT_0.33 org.Hs.eg.db_3.20.0
## [3] AnnotationDbi_1.68.0 GenomicTools.fileHandler_0.1.5.9
## [5] corrplot_0.95 data.table_1.16.4
## [7] fgsea_1.32.2 clusterProfiler_4.12.6
## [9] msigdbr_7.5.1 stringr_1.5.1
## [11] goseq_1.58.0 geneLenDataBase_1.42.0
## [13] BiasedUrn_2.0.12 ggplotify_0.1.2
## [15] plotly_4.10.4 ggrepel_0.9.6
## [17] plyr_1.8.9 tidyr_1.3.1
## [19] tximport_1.34.0 Rtsne_0.17
## [21] seriation_1.5.7 dendextend_1.19.0
## [23] ggvenn_0.1.10 gridExtra_2.3
## [25] RColorBrewer_1.1-3 pheatmap_1.0.12
## [27] xlsx_0.6.5 ashr_2.2-63
## [29] apeglm_1.28.0 IHW_1.34.0
## [31] DESeq2_1.46.0 SummarizedExperiment_1.36.0
## [33] Biobase_2.66.0 MatrixGenerics_1.18.1
## [35] matrixStats_1.5.0 GenomicRanges_1.58.0
## [37] GenomeInfoDb_1.42.3 IRanges_2.40.1
## [39] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [41] ggfortify_0.4.17 ggplot2_3.5.1
## [43] dplyr_1.1.4 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.2 BiocIO_1.16.0 bitops_1.0-9
## [4] filelock_1.0.3 R.oo_1.27.0 tibble_3.2.1
## [7] lpsymphony_1.34.0 XML_3.99-0.18 lifecycle_1.0.4
## [10] httr2_1.1.0 mixsqp_0.3-54 rprojroot_2.0.4
## [13] vroom_1.6.5 lattice_0.22-6 MASS_7.3-64
## [16] magrittr_2.0.3 openxlsx_4.2.8 sass_0.4.9
## [19] rmarkdown_2.29 jquerylib_0.1.4 yaml_2.3.10
## [22] ggtangle_0.0.6 zip_2.3.2 cowplot_1.1.3
## [25] DBI_1.2.3 abind_1.4-8 zlibbioc_1.52.0
## [28] R.utils_2.12.3 purrr_1.0.4 RCurl_1.98-1.16
## [31] yulab.utils_0.2.0 xlsxjars_0.6.1 rappdirs_0.3.3
## [34] GenomeInfoDbData_1.2.13 enrichplot_1.26.6 irlba_2.3.5.1
## [37] tidytree_0.4.6 reactome.db_1.89.0 codetools_0.2-20
## [40] DelayedArray_0.32.0 DOSE_4.0.0 xml2_1.3.6
## [43] tidyselect_1.2.1 aplot_0.2.4 farver_2.1.2
## [46] UCSC.utils_1.2.0 viridis_0.6.5 TSP_1.2-4
## [49] BiocFileCache_2.14.0 rmdformats_1.0.4 GenomicAlignments_1.42.0
## [52] jsonlite_1.8.9 survival_3.8-3 iterators_1.0.14
## [55] bbmle_1.0.25.1 foreach_1.5.2 tools_4.4.2
## [58] progress_1.2.3 treeio_1.30.0 Rcpp_1.0.14
## [61] glue_1.8.0 SparseArray_1.6.1 here_1.0.1
## [64] xfun_0.50 mgcv_1.9-1 qvalue_2.38.0
## [67] ca_0.71.1 withr_3.0.2 numDeriv_2016.8-1.1
## [70] fastmap_1.2.0 digest_0.6.37 truncnorm_1.0-9
## [73] R6_2.6.1 gridGraphics_0.5-1 colorspace_2.1-1
## [76] GO.db_3.20.0 biomaRt_2.62.1 RSQLite_2.3.9
## [79] R.methodsS3_1.8.2 generics_0.1.3 rtracklayer_1.66.0
## [82] prettyunits_1.2.0 httr_1.4.7 htmlwidgets_1.6.4
## [85] S4Arrays_1.6.0 pkgconfig_2.0.3 rJava_1.0-11
## [88] gtable_0.3.6 blob_1.2.4 registry_0.5-1
## [91] XVector_0.46.0 htmltools_0.5.8.1 bookdown_0.42
## [94] scales_1.3.0 png_0.1-8 ggfun_0.1.8
## [97] knitr_1.49 rstudioapi_0.17.1 tzdb_0.4.0
## [100] reshape2_1.4.4 rjson_0.2.23 coda_0.19-4.1
## [103] nlme_3.1-167 curl_6.2.0 bdsmatrix_1.3-7
## [106] cachem_1.1.0 parallel_4.4.2 restfulr_0.0.15
## [109] pillar_1.10.1 vctrs_0.6.5 slam_0.1-55
## [112] dbplyr_2.5.0 evaluate_1.0.3 readr_2.1.5
## [115] invgamma_1.1 GenomicFeatures_1.58.0 mvtnorm_1.3-3
## [118] cli_3.6.4 locfit_1.5-9.11 compiler_4.4.2
## [121] Rsamtools_2.22.0 rlang_1.1.5 crayon_1.5.3
## [124] SQUAREM_2021.1 labeling_0.4.3 fdrtool_1.2.18
## [127] emdbook_1.3.13 fs_1.6.5 stringi_1.8.4
## [130] viridisLite_0.4.2 BiocParallel_1.40.0 babelgene_22.9
## [133] txdbmaker_1.2.1 munsell_0.5.1 Biostrings_2.74.1
## [136] lazyeval_0.2.2 GOSemSim_2.32.0 Matrix_1.7-2
## [139] patchwork_1.3.0 hms_1.1.3 bit64_4.6.0-1
## [142] KEGGREST_1.46.0 igraph_2.1.4 memoise_2.0.1
## [145] snpStats_1.56.0 bslib_0.9.0 ggtree_3.14.0
## [148] fastmatch_1.1-6 bit_4.5.0.1 gson_0.1.0
## [151] ape_5.8-1